This notebook is reproduced from the following source: https://www.youtube.com/watch?v=opHDQzhO5Fw
Performning clustering on employment data.
library(dplyr)
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
Clean out some NA data and create new variable out of industry and minor_occupation, get the mean of nbr of people in those categories.
employed_tidy <- employed %>%
filter(!is.na(employ_n)) %>%
group_by(occupation = paste(industry, minor_occupation),
race_gender) %>%
summarise(n = mean(employ_n)) %>%
ungroup()
`summarise()` regrouping output by 'occupation' (override with `.groups` argument)
Re-shape the data getting it ready for k-means
library(broom)
package 㤼㸱broom㤼㸲 was built under R version 4.0.2
employment_demo <- employed_tidy %>%
filter(race_gender %in% c("Women", "Black or African American","Asian")) %>%
pivot_wider(names_from = race_gender, values_from = n, values_fill = 0) %>%
janitor::clean_names() %>%
left_join(employed_tidy %>% # join in total to the pivot wider results
filter(race_gender =="TOTAL") %>%
select(-race_gender) %>%
rename(total = n)) %>%
filter(total > 1e4) %>% # hard filter to get rid of some observations that have small n
mutate(across(c(asian, black_or_african_american, women), ~ . / total), # ~ function and . is the variables we have "asian" etc divided by total to create proportions
total = log(total), # log transform due to the spread in values
across(is.numeric, ~as.numeric(scale(.)))) %>% # use default scale() the as.numeric(scale(.)) we force it instead of being a matrix to numeric
mutate(occupation = snakecase::to_snake_case(occupation)) # clean up occupation values into snakecase
Joining, by = "occupation"
For all the different occupation categories to compare how similar these are in terms of these four metrics that are now logged proportions. K-mean is sensitive to how the numbers are centered and scaled.
Which occupations are the most like each other in terms of the demographic representation and in the total (nbr of working in the occupations all together)
Try k-means clustering
summary(employment_clust)
Length Class Mode
cluster 211 -none- numeric
centers 12 -none- numeric
totss 1 -none- numeric
withinss 3 -none- numeric
tot.withinss 1 -none- numeric
betweenss 1 -none- numeric
size 3 -none- numeric
iter 1 -none- numeric
ifault 1 -none- numeric
Use broom to tidy the result of kmeans
These are the three cluster centers in the four dimensional space for each of the dimensions and some additional information.
We can also do additonal things.

We can tell already that there is no clean separation of the clusters. No occupations are cleanly separated from each other. Is clustering the right approach for this? Maybe not.
We can test to at least find best number of clusters.
# check how did the clustering do by its measures in terms of fitting the data
glance(employment_clust)
Choosing k

If we had a data that has strong clusters we would see an elbow here. Tut we do not really see any good elbow point, maybe 5 is the best here but it is not clear.
We redo our clustering
This is an interactive html plot that gives us some more info about the observations. But we see no clear separations, so we can ask how appropriate kmeans is for this kind of data!?
LS0tDQp0aXRsZTogInR0X2VtcGxveW1lbnQiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQpUaGlzIG5vdGVib29rIGlzIHJlcHJvZHVjZWQgZnJvbSB0aGUgZm9sbG93aW5nIHNvdXJjZTogDQpodHRwczovL3d3dy55b3V0dWJlLmNvbS93YXRjaD92PW9wSERRemhPNUZ3DQoNClBlcmZvcm1uaW5nIGNsdXN0ZXJpbmcgb24gZW1wbG95bWVudCBkYXRhLg0KDQpgYGB7cn0NCmxpYnJhcnkodGlkeXZlcnNlKQ0KbGlicmFyeShkcGx5cikNCg0KIyByZWFkIGluIGRhdGEgZnJvbSB0dCBnaXRodWINCmVtcGxveWVkIDwtIHJlYWRyOjpyZWFkX2NzdigiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL3Jmb3JkYXRhc2NpZW5jZS90aWR5dHVlc2RheS9tYXN0ZXIvZGF0YS8yMDIxLzIwMjEtMDItMjMvZW1wbG95ZWQuY3N2IikNCmBgYA0KQ2xlYW4gb3V0IHNvbWUgTkEgZGF0YSBhbmQgY3JlYXRlIG5ldyB2YXJpYWJsZSBvdXQgb2YgaW5kdXN0cnkgYW5kIG1pbm9yX29jY3VwYXRpb24sIGdldCB0aGUgbWVhbiBvZiBuYnIgb2YgcGVvcGxlIGluIHRob3NlIGNhdGVnb3JpZXMuDQoNCmBgYHtyfQ0KZW1wbG95ZWRfdGlkeSA8LSBlbXBsb3llZCAlPiUNCiAgZmlsdGVyKCFpcy5uYShlbXBsb3lfbikpICU+JQ0KICBncm91cF9ieShvY2N1cGF0aW9uID0gcGFzdGUoaW5kdXN0cnksIG1pbm9yX29jY3VwYXRpb24pLA0KICAgICAgICAgICByYWNlX2dlbmRlcikgJT4lDQogIHN1bW1hcmlzZShuID0gbWVhbihlbXBsb3lfbikpICU+JQ0KICB1bmdyb3VwKCkNCmBgYA0KUmUtc2hhcGUgdGhlIGRhdGEgZ2V0dGluZyBpdCByZWFkeSBmb3Igay1tZWFucw0KDQpgYGB7cn0NCmxpYnJhcnkodGlkeXIpDQpsaWJyYXJ5KGJyb29tKQ0KbGlicmFyeShnZ3Bsb3QyKQ0KbGlicmFyeShwdXJycikNCmxpYnJhcnkocGxvdGx5KQ0KYGBgDQoNCg0KYGBge3J9DQplbXBsb3llZF90aWR5ICU+JQ0KIHN1bW1hcnkoKQ0KDQplbXBsb3ltZW50X2RlbW8gPC0gZW1wbG95ZWRfdGlkeSAlPiUNCiAgZmlsdGVyKHJhY2VfZ2VuZGVyICVpbiUgYygiV29tZW4iLCAiQmxhY2sgb3IgQWZyaWNhbiBBbWVyaWNhbiIsIkFzaWFuIikpICU+JQ0KICBwaXZvdF93aWRlcihuYW1lc19mcm9tID0gcmFjZV9nZW5kZXIsIHZhbHVlc19mcm9tID0gbiwgdmFsdWVzX2ZpbGwgPSAwKSAlPiUNCiAgamFuaXRvcjo6Y2xlYW5fbmFtZXMoKSAlPiUNCiAgbGVmdF9qb2luKGVtcGxveWVkX3RpZHkgJT4lICAjIGpvaW4gaW4gdG90YWwgdG8gdGhlIHBpdm90IHdpZGVyIHJlc3VsdHMNCiAgICAgIGZpbHRlcihyYWNlX2dlbmRlciA9PSJUT1RBTCIpICU+JQ0KICAgICAgc2VsZWN0KC1yYWNlX2dlbmRlcikgJT4lDQogICAgICByZW5hbWUodG90YWwgPSBuKSkgJT4lDQogIGZpbHRlcih0b3RhbCA+IDFlNCkgJT4lICMgaGFyZCBmaWx0ZXIgdG8gZ2V0IHJpZCBvZiBzb21lIG9ic2VydmF0aW9ucyB0aGF0IGhhdmUgc21hbGwgbg0KICBtdXRhdGUoYWNyb3NzKGMoYXNpYW4sIGJsYWNrX29yX2FmcmljYW5fYW1lcmljYW4sIHdvbWVuKSwgfiAuIC8gdG90YWwpLCAjIH4gZnVuY3Rpb24gYW5kIC4gaXMgdGhlIHZhcmlhYmxlcyB3ZSBoYXZlICJhc2lhbiIgZXRjIGRpdmlkZWQgYnkgdG90YWwgdG8gY3JlYXRlIHByb3BvcnRpb25zDQogICAgICAgICB0b3RhbCA9IGxvZyh0b3RhbCksICMgbG9nIHRyYW5zZm9ybSBkdWUgdG8gdGhlIHNwcmVhZCBpbiB2YWx1ZXMNCiAgICAgICAgIGFjcm9zcyhpcy5udW1lcmljLCB+YXMubnVtZXJpYyhzY2FsZSguKSkpKSAlPiUgIyB1c2UgZGVmYXVsdCBzY2FsZSgpIHRoZSBhcy5udW1lcmljKHNjYWxlKC4pKSB3ZSBmb3JjZSBpdCBpbnN0ZWFkIG9mIGJlaW5nIGEgbWF0cml4IHRvIG51bWVyaWMNCiAgICAgICAgIG11dGF0ZShvY2N1cGF0aW9uID0gc25ha2VjYXNlOjp0b19zbmFrZV9jYXNlKG9jY3VwYXRpb24pKSAjIGNsZWFuIHVwIG9jY3VwYXRpb24gdmFsdWVzIGludG8gc25ha2VjYXNlDQpgYGANCkZvciBhbGwgdGhlIGRpZmZlcmVudCBvY2N1cGF0aW9uIGNhdGVnb3JpZXMgdG8gY29tcGFyZSBob3cgc2ltaWxhciB0aGVzZSBhcmUgaW4gdGVybXMgb2YgdGhlc2UgZm91ciBtZXRyaWNzIHRoYXQgYXJlIG5vdyBsb2dnZWQgcHJvcG9ydGlvbnMuDQpLLW1lYW4gaXMgc2Vuc2l0aXZlIHRvIGhvdyB0aGUgbnVtYmVycyBhcmUgY2VudGVyZWQgYW5kIHNjYWxlZC4NCg0KYGBge3J9DQplbXBsb3ltZW50X2RlbW8gJT4lIGFycmFuZ2UoLXdvbWVuKQ0KYGBgDQoNCldoaWNoIG9jY3VwYXRpb25zIGFyZSB0aGUgbW9zdCBsaWtlIGVhY2ggb3RoZXIgaW4gdGVybXMgb2YgdGhlIGRlbW9ncmFwaGljIHJlcHJlc2VudGF0aW9uIGFuZCBpbiB0aGUgdG90YWwgKG5iciBvZiB3b3JraW5nIGluIHRoZSBvY2N1cGF0aW9ucyBhbGwgdG9nZXRoZXIpDQoNCiMjIFRyeSBrLW1lYW5zIGNsdXN0ZXJpbmcNCg0KYGBge3J9DQojIHRyeSB3aXRoIHNvbWUgZGVmYXVsdCBjZW50ZXJzIGZpcnN0DQplbXBsb3ltZW50X2NsdXN0IDwtIGttZWFucyhzZWxlY3QoZW1wbG95bWVudF9kZW1vLCAtb2NjdXBhdGlvbiksIGNlbnRlcnMgPSAzKQ0Kc3VtbWFyeShlbXBsb3ltZW50X2NsdXN0KSAjIGxpc3Qgb2Ygc3R1ZmYNCmBgYA0KVXNlIGJyb29tIHRvIHRpZHkgdGhlIHJlc3VsdCBvZiBrbWVhbnMNCmBgYHtyfQ0KdGlkeShlbXBsb3ltZW50X2NsdXN0KQ0KYGBgDQpUaGVzZSBhcmUgdGhlIHRocmVlIGNsdXN0ZXIgY2VudGVycyBpbiB0aGUgZm91ciBkaW1lbnNpb25hbCBzcGFjZSBmb3IgZWFjaCBvZiB0aGUgZGltZW5zaW9ucyBhbmQgc29tZSBhZGRpdGlvbmFsIGluZm9ybWF0aW9uLg0KDQpXZSBjYW4gYWxzbyBkbyBhZGRpdG9uYWwgdGhpbmdzLg0KYGBge3J9DQphdWdtZW50KGVtcGxveW1lbnRfY2x1c3QsIGVtcGxveW1lbnRfZGVtbykgJT4lICMgY29tcGFyZSB0aGUgY2x1c3RlcmluZyBkYXRhIHdpdGggdGhlIGFjdHVhbCBkYXRhIGFuZCBwbG90IGl0DQogIGdncGxvdChhZXModG90YWwsIHkgPSBibGFja19vcl9hZnJpY2FuX2FtZXJpY2FuLCBjb2xvciA9IC5jbHVzdGVyKSkgKw0KICBnZW9tX3BvaW50KGFscGhhID0gMC44KQ0KDQpgYGANCldlIGNhbiB0ZWxsIGFscmVhZHkgdGhhdCB0aGVyZSBpcyBubyBjbGVhbiBzZXBhcmF0aW9uIG9mIHRoZSBjbHVzdGVycy4gTm8gb2NjdXBhdGlvbnMgYXJlIGNsZWFubHkgc2VwYXJhdGVkIGZyb20gZWFjaCBvdGhlci4gSXMgY2x1c3RlcmluZyB0aGUgcmlnaHQgYXBwcm9hY2ggZm9yIHRoaXM/IE1heWJlIG5vdC4NCg0KV2UgY2FuIHRlc3QgdG8gYXQgbGVhc3QgZmluZCBiZXN0IG51bWJlciBvZiBjbHVzdGVycy4NCg0KYGBge3J9DQojIGNoZWNrIGhvdyBkaWQgdGhlIGNsdXN0ZXJpbmcgZG8gYnkgaXRzIG1lYXN1cmVzIGluIHRlcm1zIG9mIGZpdHRpbmcgdGhlIGRhdGENCmdsYW5jZShlbXBsb3ltZW50X2NsdXN0KQ0KYGBgDQoNCg0KIyMgQ2hvb3Npbmcgaw0KDQpgYGB7cn0NCmtjbHVzdHMgPC0gDQogIHRpYmJsZShrID0gMTo5KSAlPiUNCiAgbXV0YXRlKA0KICAgIGtjbHVzdCA9IG1hcChrLCB+IGttZWFucyhzZWxlY3QoZW1wbG95bWVudF9kZW1vLCAtb2NjdXBhdGlvbiksIC54KSksDQogICAgdGlkaWVkID0gbWFwKGtjbHVzdCwgdGlkeSksICMgd2UgZG9uJ3QgbmVlZCB0byBkbyB0aGlzIGJ1dCBpdCBpcyBuaWNlDQogICAgZ2xhbmNlZCA9IG1hcChrY2x1c3QsIGdsYW5jZSksDQogICAgYWd1bWVudGVkID0gbWFwKGtjbHVzdCwgYXVnbWVudCwgZW1wbG95bWVudF9kZW1vKQ0KICApDQoNCmtjbHVzdHMgJT4lICMgbmVzdGVkIGxpc3RzLCB3ZSB3YW50IHRvIHVubmVzdCB0aGlzIGdsYW5jZWQgdG8gZ2V0IHRoZSB2YWx1ZXMgc2hvd2luZyBob3cgd2VsbCB3ZSBkaWQgZG8NCiAgdW5uZXN0KGdsYW5jZWQpICU+JQ0KICBnZ3Bsb3QoYWVzKHggPSBrLCB5ID0gdG90LndpdGhpbnNzKSkgKyANCiAgZ2VvbV9saW5lKGFscGhhID0gMC44KSArDQogIGdlb21fcG9pbnQoc2l6ZSA9IDIpDQpgYGANCklmIHdlIGhhZCBhIGRhdGEgdGhhdCBoYXMgc3Ryb25nIGNsdXN0ZXJzIHdlIHdvdWxkIHNlZSBhbiBlbGJvdyBoZXJlLiBUdXQgd2UgZG8gbm90IHJlYWxseSBzZWUgYW55IGdvb2QgZWxib3cgcG9pbnQsIG1heWJlIDUgaXMgdGhlIGJlc3QgaGVyZSBidXQgaXQgaXMgbm90IGNsZWFyLg0KDQpXZSByZWRvIG91ciBjbHVzdGVyaW5nDQoNCmBgYHtyfQ0KIyByZSBydW4gdGhlIGNsdXN0ZXJpbmcgd2l0aCBrIDUsIG1ha2UgYSBpbnRlcmFjdGl2ZSBwbG90DQplbXBsb3ltZW50X2NsdXN0IDwtIGttZWFucyhzZWxlY3QoZW1wbG95bWVudF9kZW1vLCAtb2NjdXBhdGlvbiksIGNlbnRlcnMgPSA1KQ0Kc3VtbWFyeShlbXBsb3ltZW50X2NsdXN0KSAjIGxpc3Qgb2Ygc3R1ZmYNCg0KcCA8LSBhdWdtZW50KGVtcGxveW1lbnRfY2x1c3QsIGVtcGxveW1lbnRfZGVtbykgJT4lDQogIGdncGxvdChhZXModG90YWwsIGJsYWNrX29yX2FmcmljYW5fYW1lcmljYW4sIGNvbG9yID0gLmNsdXN0ZXIsIG5hbWUgPSBvY2N1cGF0aW9uKSkgKw0KICBnZW9tX3BvaW50KGFscGhhID0gMC44KQ0KDQpnZ3Bsb3RseShwKQ0KYGBgDQpUaGlzIGlzIGFuIGludGVyYWN0aXZlIGh0bWwgcGxvdCB0aGF0IGdpdmVzIHVzIHNvbWUgbW9yZSBpbmZvIGFib3V0IHRoZSBvYnNlcnZhdGlvbnMuDQpCdXQgd2Ugc2VlIG5vIGNsZWFyIHNlcGFyYXRpb25zLCBzbyB3ZSBjYW4gYXNrIGhvdyBhcHByb3ByaWF0ZSBrbWVhbnMgaXMgZm9yIHRoaXMga2luZCBvZiBkYXRhIT8NCg0KDQo=